SGLang Inference Backend Configuration Guide
SGLang is a fast and easy-to-use inference engine, particularly suitable for inference tasks of large-scale language models. This document will provide detailed instructions on how to configure and use the SGLang inference backend in the ROLL framework.
SGLang Introduction
SGLang is a structured generation language specifically designed for inference of large language models. It provides efficient inference performance and flexible programming interfaces.
Configuring SGLang Strategy
In the ROLL framework, SGLang inference strategy can be configured by setting strategy_args in the YAML configuration file.
Basic Configuration Example
The following is a typical SGLang configuration example (from examples/qwen3-30BA3B-rlvr_megatron/rlvr_config_sglang.yaml):
actor_infer:
model_args:
disable_gradient_checkpointing: true
dtype: bf16
generating_args:
max_new_tokens: ${response_length}
top_p: 0.99
top_k: 100
num_beams: 1
temperature: 0.99
num_return_sequences: ${num_return_sequences_in_group}
strategy_args:
strategy_name: sglang
strategy_config:
mem_fraction_static: 0.7
load_format: dummy
num_gpus_per_worker: 2
device_mapping: list(range(0,24))
Configuration Parameter Details
-
strategy_name: Set to
sglangto use the SGLang inference backend -
strategy_config: SGLang-specific configuration parameters. For more SGLang configuration parameters, see the official documentation. The strategy_config is passed through directly to SGLang.
mem_fraction_static: GPU memory utilization ratio for static memory such as model weights and KV cache- Increase this value if KV cache building fails
- Decrease this value if CUDA memory is insufficient
load_format: Format for loading model weights- Since the model will be "updated" at the beginning, this value can be set to
dummy
- Since the model will be "updated" at the beginning, this value can be set to
-
num_gpus_per_worker: Number of GPUs allocated per worker
- SGLang can utilize multiple GPUs for parallel inference
-
device_mapping: Specify the list of GPU device IDs to use
-
infer_batch_size: Batch size during inference